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Abstract 

We introduce the bilingual dual-coding theory as a 
model for bilingual mental representation. Based 
on this model, lexical selection neural networks are 
implemented for a connectionist transfer project in 
machine translation. 



Introduction 



Psycholinguistic knowledge would be greatly help- 
ful, as we believe, in constructing an artificial lan- 
guage processing system. As for machine trans- 
lation, we should take advantage of our under- 
standings of (1) how the languages are represented 
in human mind; (2) how the representation is 
mapped from one language to another; (3) how 
the representation and mapping are acquired by 
human. 



The bilingual dual-coding theory (Paivio 
1986| ) partially answers the above questions. It 
depicts the verbal representations for two different 
languages as two separate but connected logogen 
systems, characterizes the translation process as 
the activation along the connections between the 
logogen systems, and attributes the acquisition of 
the representation to some unspecified statistical 
processes. 

We have explored an information theoretical 



neural network (Gorin and Levinson, 1989) that 



can acquire the verbal associations in the dual- 
coding theory. It provides a learnable lexical 
selection sub-system for a connectionist transfer 
project in machine translation. 

Dual-Coding Theory 

There is a well-known debate in psycholinguis- 
tics concerning the bilingual mental representa- 
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tion: independence position assumes that bilin- 
gual memory is represented by two functionally in- 
dependent storage and retrieval systems, whereas 
interdependence position hypothesizes that all in- 
formation of languages exists in a common mem- 
ory store. Studies on cross-language transfer and 
cross-language priming have provided evidence for 



bert, 195? 



both hypotheses (de Groot and Nas, 1991; Lam- 



Dual-coding theory explains the coexistence of 
independent and interdependent phenomena with 
separate but connected structures. The general 
dual-coding theory hypothesizes that human rep- 
resents language with dual systems — the verbal 
system and the imagery system. The elements 
of the verbal system are logogens for words in a 
language. The elements of the imagery system, 
called "imagens" , are connected to the logogens in 
the verbal systems via referential connections. Lo- 
gogens in a verbal system are also interconnected 
with associative connections. The bilingual dual- 
coding theory proposes an architecture in which 
a common imagery system is connected to two 
verbal systems, and the two verbal systems are 
interconnected to each other via associative con- 
nections [Figure |l[. Unlike the within-language 
associations, which are rich and diverse, these 
between-language associations involve primarily 
translation equivalent terms that are experienced 
together frequently. The interconnections among 
the three systems explain the interdependent func- 
tional behavior. On the other hand, the different 
characteristics of within-language and between- 
language associations account for the independent 
functional behavior. 

Based on the above structural assumption, 
dual-coding theory proposes a parallel set of pro- 
cessing assumptions. Activation of connections 
between referentially related imagens and logogens 
is called referential processing. Naming objects 
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Figure 1: Bilingual Dual-Coding Representation 

and imaging to words are prototypical examples. 
Activation of associative connections between lo- 
gogens is called associative processing. Lexical 
translation is an example of associative process- 
ing between two languages. 

Connectionist Lexical Selection 
Lexical Selection 

Lexical selection is the task of choosing target lan- 
guage words that accurately reflect the meaning of 
. the corresponding source language words , It plays 



an important role in machine translation (Puste . 



jovsky and Nirenburg, 1987) 



A common lexical selection practice involves 
an intermediate representation. It disambiguates 
the source language words to entities in the inter- 
mediate representation, then maps from the en- 
tities to the target lexical entries. This inter- 
mediate representation may be Lexical Concept 



Structure (Dorr, 1989) or intcrlingua (Nirenberg 



1987). This engineering approach requires great 



effort in designing the representation and the map- 
ping rules. 

Currently, there are some efforts in statis- 
tical lexical selection. A target language word 
Wi can be selected with the posterior probability 
Pr(W^| Ws) given the source language word Ws- 
Several target language lexical entries may be se- 
lected for a single source language word. Then the 
correct selections can be identified by the language 



model of the target language ( Brown, 1990 ). This 
approach is learnable. However, the accuracy is 
low. One reason is that it does not use any struc- 
tural information of a language. 

In next subsections, we propose information- 
theoretical networks based on the bilingual dual- 
coding theory for lexical selection. 

Information-Theoretical Networks 

Information-theoretical network is a neural net- 
work formalism that is capable of doing associa- 
tions between two layers of representations. The 
associations can be obtained statistically accord- 
ing to the network's experiences. 

An information-theoretical network has two 
layers. Each unit of a layer represents an element 
in the input or output of a training pattern, which 
might be a logogen or a word. Units in different 
layers are connected. The weight of the connec- 
tion between unit i in one layer and unit j in the 
other layer is assigned with the mutual informa- 
tion between the elements represented by the two 
units 

(1) w l3 = I{Vi,Vj) = \og{Pr{v 3 v t )/Pr{v,)) [j 

Each layer also contains a bias unit, which is 
always activated. The weight of the connection 
between the bias unit in one layer and unit j in 
the other layer is 

(2) w Qj = log Pr( Vj ) 

Both the information-theoretical network and 
the back-propagation network compute the pos- 



terior probabilities for an association task (Gorin 



and Levinson, 1989| ; Robinson, 1992j ). However 



only the information-theoretical network is iso- 
morphic to the directly interconnected verbal sys- 
tems in the dual-coding theory. Besides, an 
information-theoretical network has the following 
advantages: (1) it learns fast. The network can 
learn in a single pass without gradient decent. 
(2) it is adaptive. It can incrementally adapt to 
new experiences simply by adding new data to the 
training samples and modifying the associations 
according to the changed statistics. These make 
the network more psychologically plausible. 

Lexical Selection as an Associative 
Process 

We tried to map source language f-structures 
to target language f-structure in a connectionist 
transfer project (Wang, 1994). Functionally, there 



Where Vi means the event that unit i is activated. 



were two sub-tasks: 1. finding the target sub- 
structures, their phrasal categories and their cor- 
responding source structures; 2. finding the head 
of a target structure. The second sub-task is a 
problem of lexical selection, ft was first imple- 
mented with a back-propagation network. 

We replaced the back-propagation networks 
for lexical selection with information-theoretical 
networks simulating the associative process in the 
dual-coding theory. The networks have two lay- 
ers of units. Each source (target) language lexical 
item is represented by a unit in the input (output) 
layer. One network is constructed for each phrasal 
category (NP, VP, AP, etc.). 

The networks works in the following way: for 
a target-language f-structure to be generated, the 
transfer system knows its phrasal category and 
its corresponding source-language f-structure from 
the networks that perform the sub-task 1. It 
then activates the lexical selection network for 
that phrasal category with the input units that 
correspond to the heads of the source language 
f-structure and its sub-structures. Through the 
connections between the two layers, the output 
units are activated, and the lexical item that cor- 
responds to the most active output unit is selected 
as the head of the target f-structure. The fol- 
lowing example illustrates how the system selects 
the head anmelden for the German XCOMP sub- 
structure when it does the transfer from 

[sentence [subj *] Would [xcomp [subj f\ like [xcomp 

[subj A register [ pp - a dj for the conference]}]] to 



[subj Ich] werde 



xcomp [subj 



Ich] 



adj 



gerne] anmelden [ pp - a dj fuer der Konferenz]]^. 

Since the structure networks find that there is 
a VP sub-structure of XCOMP in the target struc- 
ture whose corresponding input structure is [xcomp 
[subj I] to register [ PP - a dj for the conference]]], it 
activates the VP lexical selection network's input 
units for I, register and conference. By propagat- 
ing the activation via the associative connections, 
the unit for anmelden is the most active output. 
Therefore, anmelden is chosen as the head of the 
xcomp sub-structure. 

Preliminary Result 

The domain of our work was the Conference Regis- 
tration Telephony Conversations. The lexicon for 
the task contained about 500 English and 500 Ger- 
man words. There were 300 English/ German f- 

2 The f-structures are simplified here for the sake of 
conciseness. 



structure pairs available from other research tasks 
(Ostcrholtz, 1992). A separate set of 154 senten- 
tial f-structures was used to test the generalization 
performance of the system. The testing data was 
collected for an independent task (Jain, 1991). 

From the 300 sentential f-structure pairs, ev- 
ery German VP sub-structure is extracted and 
labeled with its English counterpart. The En- 
glish counterpart's head and its immediate sub- 
structures' heads serve as the input in a sample 
of VP association, and the German f-structure's 
head become the output of the association. For 
the above example, the association {[input I, reg- 
ister, conference] [output anmelden]) is a sample 
drawn from the f-structures for the VP network. 
The training samples for all the other networks are 
created in the same way. 

The accuracy of our system with information- 
theoretical network lexical selection is lower than 
the one with back-propagation networks (around 
84% versus around 92%) for the training data. 
However, the generalization performance on the 
unseen inputs is better (around 70% versus around 
62%). The information-theoretical networks do 
not over-learn as the back-propagation networks. 
This is partially due to the reduced number of 
free parameters in the information-theoretical net- 
works. 

Summary 

The lexical selection approach discussed here has 
two advantages. First, it is lcarnablc. Little hu- 
man effort on knowledge engineering is required. 
Secondly, it is psycholinguistically well-founded in 
that the approach adopts a local activation pro- 
cessing model instead of relies upon symbol pass- 
ing, as symbolic systems usually do. 
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